17 research outputs found

    Position Heaps for Parameterized Strings

    Get PDF
    We propose a new indexing structure for parameterized strings, called parameterized position heap. Parameterized position heap is applicable for parameterized pattern matching problem, where the pattern matches a substring of the text if there exists a bijective mapping from the symbols of the pattern to the symbols of the substring. We propose an online construction algorithm of parameterized position heap of a text and show that our algorithm runs in linear time with respect to the text size. We also show that by using parameterized position heap, we can find all occurrences of a pattern in the text in linear time with respect to the product of the pattern size and the alphabet size

    M.: Unsupervised spam detection based on string alienness measures

    No full text
    We propose an unsupervised method for detecting spam documents from Web page data, based on equivalence relations on strings. We propose 3 measures for quantifying the alienness (i.e. how different it is from others) of substring equivalence classes within a given set of strings. A document is then classified as spam if it contains a characteristic equivalence class as a substring. The proposed method is unsupervised, independent of language, and is very efficient. Computational experiments conducted on data collected from Japanese web forums show fairly good results

    Filtering Multi-set Tree: Data Structure for Flexible Matching Using Multi-track Data

    No full text

    Filtering Multi-set Tree : Data Structure for Flexible Matching Using Multi-track Data

    Get PDF
    Special Section: Nowcast and Forecast of Road Traffic by Data Fusion of Various Sensing Dat

    Detecting blog spams using the vocabulary size of all substrings in their copies

    No full text
    This paper addresses the problem of detecting blog spams, which are unsolicited messages on blog sites, among blog entries. Unlike a spam mail, a typical blog spam is produced to increase the PageRank for the spammer’s Web sites, and so many copies of the blog spam are necessary and all of them contain URLs of the sites. Therefore the number of the copies, we call it the frequency, seems to be a good key to find this type of blog spams. The frequency is not, however, sufficient for detection algorithms which detect an entry as a blog spam if the frequency is greater than some threshold value, because of the following reasons: it is very difficult to collect Web pages including all copies of a blog entry; therefore an input data contains only a few copies of the entry whose number may be smaller than the predefined threshold; and thus a frequency based spam detection algorith
    corecore